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^^AJ^ CROSS-REFERENCES TO RELATED APPLICATIONS 

This application claims priority . from U.S. provisional patent 
application number 60/035,205 filed 01/10/97, which is 
incorporated herein by reference. 

STATEMENT REGARDING GOVERNMENT SUPPORT 

This invention was supported in part by the National Science 
Foundation . grant number IRI-94113 06-4 . The Government has 
certain rights in the invention. 

FIELD OF THE INVENTION 

This invention relates generally to techniques for analyzing 
linked- databases. More particularly, it relates to methods for 
assigning ranks to nodes in a linked database, such as any 
database of documents containing citations, the world wide web 
or any other hypermedia databas e . 

BACKGROUND OF THE INVENTION 

Due to the developments in computer technology and its increase 
in popularity, large numbers of people have recently started to 
frequently search huge databases. For example, internet search 
, engines are frequently used to .search the entire world wide web. 
Currently, a popular search engine might execute, over 30 million 
searches per day of the indexable part of the web, which has a 
size in excess of 500 Gigabytes. Information retrieval systems 
are traditionally judged by their precision and recall. What is 
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often neglected, however, is the quality of the results produced 
by these search engines. Large databases of documents such as 
the web contain many low quality documents. As a result, 
searches typically return hundreds of irrelevant or unwanted 
documents which camouflage the few relevant ones. In order to 
improve the selectivity of the results, common techniques allow 
the user to constrain the scope of the search to a specked 
subset of the database, or to provide additional search terms. 
These techniques are most effective in cases where the database 
is homogeneous and already classified into subsets, or in cases 
where the user is searching for well known and specific 
information. In other cases, however, these techniques are 
often not effective because each constraint introduced by the 
user increases the chances that the desired information will be 
inadvertently eliminated from the search results. 

Search engines presently use various techniques that attempt to 
present more relevant documents. Typically, documents are 
ranked according to variations of a standard vector space model. 
These variations could include (a) how recently the document was 
updated, and/or (b) how close the search terms are to the 
beginning of the document. Although this strategy provides 
search results that are better than with no ranking at all, the 
results still have relatively low quality. Moreover , when 
searching the highly competitive web, this measure of relevancy 
is vulnerable to "spamming" techniques that authors can use to 
artificially inflate their document's relevance in order to draw 
attention to it or its advertisements. For this reason search 
results often contain commercial appeals, that should not be 
considered a match to the query. Although search engines are 
designed to avoid such ruses, poorly conceived mechanisms can 
result in disappointing failures to retrieve desired 
information. 
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Hyperlink Search Engine, developed by IDD Information Seryxces, 
(http-//rankdex.gari.com/) uses backlink information (x.e 
information from pages that contain links to the current page) 
to assist in identifying relevant web documents. Rather than 
using the content of. a document to determine relevance, .the 
technique uses the anchor text of links to the document to 
characterize the relevance of a document. The idea of 
associating anchor text with the page the text points to was 
first implemented in the World Wide Web Worm (Oliver A. McBryan, 
GENVL and WWWW: Tools for Taming the Web, First International 
Conference on the World Wide Web, CERN, Geneva, May 25-27, 
1994 ) The Hyperlink Search Engine has applied this idea to 
assist in determining document relevance in, a search. In 
particular, search query terms are compared to a collection of 
anchor text descriptions that point to. the page, rather than to 
a keyword index of the page content. A rank is then assigned to 
a document based on the degree to which . the search terms match 
the anchor descriptions in its backlink documents . 

The well known idea of citation counting is a simple, method for 
determining the importance of a document by counting its number 
of citations, or backlinks. The citation rank r(A) of a document 
which has n backlink pages is simply 

r (A) = n. 

in the case of databases whose content is of relatively uniform 
quality and importance it is valid to assume that a highly cited 
document should be of greater interest than a document with only 
one or two citations. Many databases, however, have extreme 
variations in the quality and importance of documents . In these 
cases citation ranking is overly simplistic. For example, 
citation ranking will give the same rank to a document that is 
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, t-n a similar document that is 

cited once on an obscure page as to a simiidx 

cited once on a well-known and highly respected page. 

'provide a method for ranking documents in a linkedja^se. It 
is another object of the invention to provide^tlch a method that 
provides an objective ranking basedc^thT relationship between 
documents. Another object of/thT invention is to provide a 
technique for ranking do^jafes within a database whose content 
has a large variatip^^T quality and importance. Another object 
of the presentation is to provide a document ranking method 
that is scalable and can be applied to extremely large databases 
such/as 'the world wide web. Additional objects and advantages 
tfll become apparent in view of the following description and 

J ^ fee __py e geQt-HnivHn ti on acbarevgs~^Ehe above objects by taTp 
Advantage of the linked structure of a database to a^Jga^a rank 
to each document in the database, where the document rank is a 
measure of the importance of a document Rather than 
determining relevance from the intrinsip-cWent of a document, 
or from the anchor text of backus to the document, the 
present method determines importance from the extrinsic 
relationships between docurrjenls. Intuitively, a document should 
be important < regard Wof its content) if it is highly cited 
by other documents/Not all citations, however, are of equal 
significance. Agitation from an important document, is more 
important tfea^ a citation from a relatively unimportant 
document/Thus, the importance of a page, and hence the rank 
assign^! to it, should depend not just on the number of 
citrons it has, but on the importance of the citing documents 

rirr es a recuibi^e de fiaar tion nf _r ank: the g arik 
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i^^um-jproc fidure on. a linked _da£ahase-^ 



Because citations, or links, are ways of directing attention, 
the important documents correspond to those documents to which 
the most attention is directed. Thus, a high rank indicates 
that a document is considered valuable by many people or by 
important people. Most likely, these are the pages to which 
someone performing- a search would like to direct his or her 
attention. Looked at another way> the importance of a page is 
directly related to the steady-state probability that a random 
web surfer ends up at the page after following a large number of 
links Because there is a larger, probability that a surfer wxll 
end up at an important page than at an unimportant page, this 
method of ranking pages assigns higher ranks to the more 
important pages . 




^ed^nodes 

of a linked database. The method comprises thes^epf of : 



_£a— ene— aspect 

^provided for calculating an importance rank for Niin 



(a) selecting an initial N-dimension^vector .po; 

(b) computing an approximation p^ a steady-state probability 



Peo in accordance with the equation Pn = A^o, where A is an NxN 
transition probability^ matrix having elements A[i] [j] 
representing a probability of moving from node i to node j ; and 
(c) determining /ank .r [k] for a node k from a k th component 

of l>ir- 



0^ 



sfe^see^einbetoeaW^ 

we^tedj^r^^ 
^where^eae^o-f^e-baek-l-tnk -nodes is weighted- i-n-dependeace-upoii 
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the importance rank of a node is calculated^n-f^t . from a 
constant a representing the probabili^at a surfer wxll 
randomly jump to the node . Th^liSportance rank of a node can 
also be calculated, in Ea^tTfrom a measure of distances between 
the node and bacJd^nlT nodes of the node. The initial N- 
dimensionai sector p„ may be selected to. represent a uniform 
probabi*^ distribution, or a non-uniform probability 
d JeScOTro n which gives weight to d pxeJe t eHnin e d - cat- » f n odes, 



Fig. 
Fig. 



BRIEF DESCRIPTION OF THE DRAWINGS 

1 is a diagram of the relationship between three linked 
hypertext documents according to the invention, 

2 is a diagram of a three -document web illustrating the 
rank associated with each document in accordance with the 
present invention. 



DETAILED DESCRIPTION 

-wxn-g — de-t-axted - d'e 



U specifics for the. purposes of i 




"ordinary 

skill in the art will appreciate thatmany- varia^ and 
alterations to the following details^rT^thin the scope of the 
invention. Accordingly, the^rTowing preferred embodiment of 
the invention is se^f-o^th without any loss of generality to, 
and without imposing limitations upon, the claimed invention. 
For supuor^tn reducing the present invention to practice, the 
inven^r acknowledges Sergey Brin, Scott Hassan, Rajeev Motwani. 
xeit foerg, and T erry Winogra&. 

A linked .-database (i.e. any database of documents containing 
mutual citations, such as the world wide web or other hypermedia 
archive a dictionary or thesaurus, and a database of academic 
articles, patents, or court cases) can be represented as a 
directed graph of N nodes, where each node corresponds to a web 
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page document and where the directed connections between nodes 
correspond to links from one document to another. A given node 
has a set of forward links that connect it to children nodes, 
and a set of backward links that connect it to parent nodes. 
FIG. 1 shows a typical relationship between three hypertext 
documents A, B,. and C. As shown in this particular figure, the 
first links in documents B and C are pointers to document A. In 
this case we say that B and C are backlinks of A, and that A is 
a forward link of B and of C, Documents B and C also have other 
forward links to documents that are not shown. 

Although the ranking method of the present invention is 
superficially similar to the well known idea of citation 
counting, the present method is more subtle and complex than 
citation counting and gives far superior results. In a simple 
citation ranking, the rank of a document A which has n backlink 
pages is simply 

r (A) = n. 



According to one embodiment of the present method of ranking, 
i the backlinks from different pages are weighted differently and 
the number of links' on each page is normalized. More precisely, 
the rank of a page A is defined according to the present 
25 invention as 



r( A) = -+. d-a) 
N 



r(B!) + r(B n ) 

Bi | " " I B n \j 



where Bi , B n are the backlink pages of A, r(Bi),..., r(B n ) 

are their ranks, | B x | |B n | are their numbers of. forward 

links, and a is a constant in the interval [0,13, and N is the 
total number of pages in the web. This definition is clearly 
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more complicated and subtle than the simple citation rank. Like 
the citation rank, this definition yields a page rank that 
increases as the number of backlinks increases . But the present 
method considers a citation from a highly ranked backlink as 
more important than a citation from a lowly ranked backlink 
(provided both citations come from backlink documents that have 
an equal number of forward links) . In the present invention, it 
is possible, therefore, for a document with only one backlink 
(from a very highly ranked page) to have a higher rank than 
another document with many backlinks (from very low ranked 
pages) . This is not the case with simple citation ranking. 

The ranks form a probability distribution over web pages, so 
that the sum of ranks over all web pages is unity. The rank of 
a page can be interpreted as the probability that a surfer will 
be at the page after following a large number of forward links. 
The constant a. in the formula is interpreted as the probability 
that the web surfer will jump randomly to any web page instead 
of following a forward link. The page ranks for all the pages 
can be calculated using a simple iterative, algorithm, and 
corresponds to the principal eigenvector of the normalized link 
matrix of the web, as will be discussed in more detail below. 

In order to illustrate the present method of ranking, consider 
the simple web of three documents shown in FIG. 2. For 
simplicity of illustration, we assume in this example that r=0. 
Document A has a single backlink to document C, and this is the 
only forward link of document C, so 

r(A) = r (C) . 

Document B has a single backlink to document A, but this is one 
of two forward links of document A, so 
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r(B) = r (A) /2 . 

Document C has" two backlinks. One backlihk is to document B, 
and this is the only forward link of document B. The other 
backlink is to document A via the other of the two forward links 
from A. Thus 

r(C) = r(B) + r (A) 12. 

In this simple illustrative case we can see by inspection that 
r(A) =0.4, r(B) = 0.2, and r(C) = 0.4. Although a typical value 
for a is -0.1, if for simplicity we set a =. 0.5 (which 
corresponds to a 50% chance that a surfer will randomly jump to 
one of the three pages rather than following a forward link) , 
then the mathematical relationships between the ranks become 
more complicated. In particular, we then have 

r(A) = 1/6 + r (C) II, 
r(B) = 1/6 + r(A) /4, and 
r(C) = 1/6 + r (A) /4 + r{B) /2. 

The solution in this case is r(A) = 14/39, r(B) = 10/39, and 
r(C) = 15/39. 

In practice, there are millions of documents and it is not 
possible to find the solution to a million equations by 
inspection. Accordingly, in the preferred embodiment a simple 
iterative procedure is used. As the initial state we may simply 
set all the ranks equal to 1/N. The formulas are then used to 
calculate a new set of ranks based on the existing ranks. In 
the case of millions of documents, sufficient convergence 
typically takes on the order of 100 iterations. It is not 
always necessary or even desirable, however, to • calculate the 
rank of every page with high precision. Even approximate rank 
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values, using two or more iterations, can provide very valuable, 
or even superior, information. 

The iteration process can be understood as a. steady-state 
probability distribution calculated from a model of a random 
surfer This model is mathematically equivalent to the 
explanation described above, but provides a more direct and 
concise characterization of the procedure. The model includes 
(a) an initial N-dimensional probability distribution vector Po 
where each component p 0 [i] gives the initial probability that a 
random surfer will start at a node i, and (b) an NxN transition 
probability matrix A where, each component A[i][j] gives the 
probability that the surfer will move from node i to node 3 . 
The probability distribution- of the graph after the surfer 
follows one link is Pl = Ap 0 , and after two links the 
probability distribution is p 2 = A Pl = a2 Po - Assuming this 
iteration converges, it will converge to a steady-state 
probability 

p lim An 

which is a dominant eigenvector of A. The iteration circulates 
the probability through the linked nodes like energy flows 
through a circuit and accumulates in important places. Because 
pages with no links occur in significant numbers and bleed off 
energy they cause some complication with computing the ranking. 
This complication is caused by the fact they can add huge 
amounts to the -random, jump" factor. This, in turn, . causes 
loops in the graph to be highly emphasized which is not 
generally a desirable property of the -model. In order to 
address this problem, these childless pages can simply be 
removed from the model during the iterative stages, and added 
back in after the iteration is complete. After the childless 
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pages are added back in, however, the same number of iterations 
that was required to remove them should be done to make sure 
they all receive a 'value. (Note that in order to ensure 
convergence, the norm of Pi must be made equal to 1 after each 
iteration.) An alternate method to control the contribution of 
the childless nodes is to only estimate the steady state by 
iterating a small number of times . 

The rank r[i] of a node i can then be defined as a function of 
this steady-state probability distribution. For example, the 
rank can be defined simply by x[i] = P~[i] • This method of 
calculating rank is mathematically equivalent to the iterative 
method described first. Those skilled in the art will 
appreciate that this same method can be characterized in various 
| 5 different ways that are mathematically equivalent. Such 
& characterizations are obviously within the scope of the present 
^ invention. Because the rank of various different documents can 
T vary by orders of magnitude, it is convenient to define a 
lr! logarithmic rank 



10 



m 



hi 



Pce[j] 

r[il = 1°? min / ' rkl ) 

which assigns a rank of 0 to the lowest ranked node and 
increases by 1 for each order of magnitude in importance higher 
than the lowest ranked node. 
/<S7 

^"performed to approximate p.. The initial" di^ribution^an be 
selected to be uniform or non-unifc^m^^^^ distribution 
would set each component^pT^al to 1/N. A non-uniform 
distribution^o^ca^ can divide the initial probability 
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^r-a-prg ferred embodim e nt, ther -transitiou 



a 



a = - 11 + (l-a)B, 

N 



where i is an NxN matrix consist! 



of all Is, a is the 

probability that a surfer will jump^andomly to any one of the N 
nodes, and B is a matrix whose elements B[i] [j] are given by 




n-=by 



B[i] [j] 



{ — if node i/points to node j 
0 otherwise 
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where hi is the total number of forward links from node i. The 
(l-oc) factor af/s as a damping factor that limits the extent to 
which a document's rank can be inherited by children documents. 
This model/ the fact that users typically jump to a different 
place iiySe web after following a few links . The value of a is 
typically around 15%. Including this damping is important when 
many/Iterations are used to calculate the rank so that there is 
no/artificial concentration of rank importance within loops of 
(e web. Alternatively, one may set a=0 and only iterate a few 
fineT~in"T:he caicu±afeioxu 

- - — «4oas- pur p os es-, As— already— meni^^ed-above-, 
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sites by changing the 1 matrix to another matrix. Example, 
it could be distributed so that a random jump takeT the surfer 
to one of a few nodes that have a high important, and will not 
take the surfer to any of the other nodes^ This can be very 
effective in preventing deceptively^agged documents from 
receiving artificially inflated relevance. Alternatively, the 
random linking probability could^b< distributed so that random 
jumps do not happen from higyimportance nodes, and only happen 
from other nodes , This distribution would model a surfer who is 
more likely to make random jumps from unimportant sites and 
follow forward links/from important sites. A modification to 
avoid drawing unwanted attention to pages with artificially 
inflated relevant is to ignore local links between documents 
and only consider links between separate domains . Because the 
links from o&er sites to the document are not directly under 
the contro/of a typical web site designer, it is then difficult 
for the designer to artificially inflate the ranking. A simpler 
approach is to weight links from pages contained on the same web 
" served less than links from other servers. Also, in* addition to 
servers, internet domains and any general measure of the 
distance between links could be used to determine such a 

Additional modifications. can further improve the performance of 
this method. Rank can be increased for documents whose backlinks 
are maintained by different institutions and authors in various 
geographic locations. Or it can be increased if links come from 
unusually important web locations such as the root page of a 
domain. 

Links can also be weighted by their relative importance within a 
document. For example, highly visible links that are near the 
top of a document can be given more weight. Also, links that are 
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in large fonts or emphasized in other ways can be. given more 
weight, in this way, the model better approximates human usage 
and authors' intentions. In many cases it is appropriate to 
assign higher value to links coming from pages that have been 
modified recently since such information is less likely to be 
obsolete. 

The pr cceT^-"~ ^"" i ima the advanta ge -that th p rnnveggeft&e-^s 
""very fast (a few hours using current processors) andi^tsl mich 
less expensive than building a full-text index^This speed 
allows the ranking to be customized or personated for specific 
users. For example, a user's home page^SVor bookmarks can be 
given -a large initial importance, an,dV^r a high probability of a 
random jump returning to it^fhis high rating essentially 
indicates to the syst^m--6hat the person's homepage and/or 
bookmarks does J^detTcfcontain subjects of importance that should 
be highly^raliked. This procedure essentially trains the system 
te^egg^Snlae-^ a g es r elat e d to th e p ers on' s i nt exes^- 
The present method of determining the rank of a document can 
also be used to enhance the display of documents. In 
particular, each link in a document can be annotated with an 
icon, text, or other indicator of the rank of the document that 
each' link points to. Anyone viewing, the document can, then 
easily see the relative importance of various links in the 
document . 

The present method of ranking documents in a database can also 
be useful for estimating the amount of attention any document 
receives on the web since it models human. behavior when surfing 
the web. Estimating the importance of each backlink to a page 
can be useful for many purposes including, site design, business 
arrangements with the backlinkers, and marketing. The effect of 
potential changes to the hypertext structure can be evaluated by 
adding them to the link structure and recomputing the ranking. 
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Real usage data, when available, can be used as a starting point 
for the model and as.: the distribution for the alpha factor. 
This can allow this ranking model to fill holes in the usage 
data and provide a more accurate or comprehensive picture.. 
■ Thus although this method of ranking does not necessarily match 
the actual traffic, it nevertheless measures the degree of 
exposure a document has throughout the' web. 

Eej?bapg ^ applic ati on o£ L hc pro s^-^ 

^"technique is to enhance the quality of results fromw^Tearch 
engines. In this application of the presentment ion, the 
ranking method of the invention is integrating a web search 
engine to produce results far superior^ existing methods m 
quality and performance. A search^ngine employing the rankxng 
method of the present invention has all the advantages of 
automation while product results comparable to a human 
maintained categorizes tern. In this approach, a web crawler 
explores the web^nd creates an index of the web content, as 
well as a dire^ed graph of nodes corresponding to the structure 
' of hyperlinks . The nodes of the graph (i.e. pages of the web) 
are^hlm ranked according to importance according to the method 
• pre s er rC~Tnvent ieSfir** . 
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^specified search criteria, either by searching 

iaxch 



i-£-e — Gtoe-umen-fcs — that match^fcfee 

text, or by 

searching titles only. In addition, the^e^rch can include the 
anchor text associated with backJ^T to the page. This idea 
has several advantages ^n^hlTcontext . First, anchors often 
provide more accurate^scriptions of web pages than the pages 
themselves. Setlo^d, anchors may exist for. images , programs , and 
other obie^s that cannot be indexed by a text-based search 
This also makes it possible to return web pages which 
)CCfr - 6 ^j^ r — —x*i — aHH i i- i on . the_^n<=ri-ne-ea*i 
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c ^ np5g§ - TK g-s^ ch terms wxt TTirTTiir^rTts Jsacklig^oc 
titles Thus, even though the text^of^the^ocu^nt itself may 
not match the" search te^sT^the document is cited by 
documents whose ti£letTo7 backlink anchor text match the search 
terms the dpxrf£a* will be considered a match. In addition to 
" or instea^of the anchor text, the text in the immediate 
vicing of the backlink anchor text can also be compared to the 
^ e ^ rch --term s in ord e r -fee-^myr ov e fch e-ggarch* . 

^terms', the list of documents is then^arted-TTith high ranking 



documents first and lowrar^ng-db^uments last. The ranking xn 
this case is^fiRecTaTT function which combines all of the 
above^fa^Tsuch as the objective ranking and textual 
-matching. If desired, the results can be grouped by category or 




.jt^w^r-toe-c tecu. Lo on e-^^d-±n— the— airt^hai^Lli^ 
^embodiment may be alteredjjijna^^ from 
the sco^--o^^^^^on- Accordingly, the scope of the 
43j£en5on^ht^d-1^^ ir 
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